In this week’s lab, the main goal is to learn how to define and make effective plots to answer questions about data. On the due date, turn in your Rmd file and the html product.
Open your project for this class. Make sure all your work is done relative to this project.
Open the lab4.Rmd file provided with the instructions. You can edit this file and add your answers to questions in this document.
In each of these plots from previous labs, write out the grammar that defines the mapping of the data to the display:
(3pts)
DATA: PISA_oz_sub
AESTHETICS/MAPPINGS: x=PV1MATH
GEOM: histogram
STAT: bin
POSITION: identity
COORDINATE: cartesian
FACET: noneDATA: PISA_oz_sub, without missing values
AESTHETICS/MAPPINGS: x=ST27Q02
GEOM: bar
STAT: count
POSITION: identity
COORDINATE: cartesian
FACET: none
DATA: PISA_oz_sub, without missing values in ST27Q02
AESTHETICS/MAPPINGS: x=ST27Q02, y=PV1MATH
GEOM: boxplot
STAT: boxplot
POSITION: dodge
COORDINATE: cartesian
FACET: none
DATA: ningaloo_freqwhales
LAYER 1: (not important to say anything more than this)
DATA: map
LAYER 2:
DATA: ningaloo_freqwhales
AESTHETICS/MAPPINGS: x=Longitude, y=Latitude,
colour=`Marked Individual`
GEOM: point
STAT: identity
POSITION: identity
COORDINATE: cartesian
FACET: `Marked Individual`
LAYER 3:
DATA: ningaloo_freqwhales
AESTHETICS/MAPPINGS: x=Longitude, y=Latitude,
colour=`Marked Individual`
GEOM: line
STAT: identity
POSITION: identity
COORDINATE: cartesian
FACET: `Marked Individual`
For the hotel booking data, file budapest.csv make a plot to answer this question: “How far ahead of the check-in date do people typically search for a hotel room?”, and write a sentence or two answering it. In the last lab you did the wrangling necessary to get the data into shape. You may also need to do a bit more cleaning to remove very strange differences like those less than 0, and more than a year ahead searches. (EXTRA CREDIT POINT: Explain how these odd values arose.)
A univariate display of the distribution of differences is the best way to answer this question. Using a histogram, or a density plot, is better than a boxplot, because it allows for a more detailed glimpse of the distribution.
The distribution is heavily right-skewed. Most searches are done within a month of the accommodation needed. There are a very few searches done more than 3 months ahead of time. There appear to be two small modes corresponding to about 3 months ahead, 2 months ahead.
Possible reasons for the negative values, are time zone errors of the person making the search from a different part of the globe. And searches beyond a year, are possible, but fairly rare. Looking at the search dates corresponding to these odd times reveals that many of the negatives come from "1900-01-01" being the SRCH_BEGIN_USE_DATE, and for the positives "2020-10-21" being the SRCH_BEGIN_USE_DATE. It may be that these are defaults in the booking web site to give something to kick off quotes, or just ways it is automatically filled if it is missing in the actual click through data collection.
For the 2015 PISA results, design plots to answer these questions, explain your reasons for the design, and write an answer to the question.
The median score and middle 50% of scores differ between the school types, but the overall range is the same. The best scores, based on the median, are the independent, then catholic and followed by government schools. However, the top scores at each of the schools are basically the same, and similarly the worst scores are the same.
There is not much difference. On the median, students score better if they were born may and june, and a bit less in april.
A mosaic plot is the easiest way to examine the relationship between these two variables, but a jittered dotplot is a good start. There is positive association between the variables. Overwhelmingly, is a household has no TVs it also has no cars, and if it has 4 tvs then it is almost certain to have at least 3 cars. Most household have at least 2 TVs.
Side-by-side boxplots are a good choice here, because internet_use is discrete. I do like a jittered dotplot, with the means overlaid, though, because it shows the skewness clearly.
There is not much association between the two variables. Study time dips a little, on average, in categories 3, 4, 5 of internet use, that is as usage increases from 30mins to 4 hours. But it increases a bit again with internet use over 4 hours!
No time 1-30 minutes per day
1 2
31-60 minutes per day Between 1 hour and 2 hours per day
3 4
Between 2 hours and 4 hours per day Between 4 hours and 6 hours per day
5 6
More than 6 hours per day Valid Skip
7 95
Not Reached Not Applicable
96 97
Invalid No Response
98 99
List of 1
$ aspect.ratio: num 1
- attr(*, "class")= chr [1:2] "theme" "gg"
- attr(*, "complete")= logi FALSE
- attr(*, "validate")= logi TRUE
You need to do a scatterplot of these two variables. Its a bit messy, and there is little data out in the tails. Focus in on the range of belong around zero to examine the relationship. Add a smoother, or linear model to focus.
There is a weak relationship, but on average science scores increase a little (about 50 points) as the belong score increases.
In this part, we are going to take a look at historical weather for Melbourne. Download the latest data for the Melbourne airport station, ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/all/ASN00086282.dly. You may need to use Google Chrome. Using the same wrangling code from the previous lab, get the data in shape, and make plots to answer these questions:
There is a ot of data, so using a smoother is a good way to answer this question. A smoother will allow for nonlinear association, which places less constraints on the relationship than a linear model fit. Using year as the x variable gives enough temporal resolution to do the smoothing and examine the long term trend. (A slight adjustment to the data is needed, that 2017 is not complete, so only records to 2016 should be used when examining on the yearly scale.)
Since 1970, the maximum tempoerature has increased roughly 2 degrees.
The easiest way to address this question is to use side-by-side boxplots.
Generally max temperatures are lower in the winter than the summer months. The variability is smaller in the winter months. Winter max temperatures are always above 5, and the highest max temperature, about 46, occurred in a February.
The best way to answer this question is to use a smoother again. It is best to look at precipitation on a sqrt scale, though, because is it heavily right-skewed.
It does look like there is a slight downward trend, since 1970.
This is tricky! The neatest way to look at this is calculate historical average global max and min for June, use a vertical bar to represent this. And then overlay vertical bars for this year's temperature range. The focus is on the range of temperatures each day. (You could also think about using the average of the historical min/max's to represent the past, or 10/90 percentiles.)
We would expect the historical bars to be wider, since it represents more data. The bars for this year, should be roughly in the middle, and if they are shifted one direction or another it says this year was warmer or cooler than expected. For many of the days this year, particularly in the middle it was warmer than we might expect. In the latter part of the month, the temperatures were closer to the historical averages, and perhaps less varied.